Title: Reconsidering the Significance of Genomic Word Frequencies 1 2 Short Title: Genomic Word Frequencies 3 4 Introduction

نویسنده

  • Gregory Kucherov
چکیده

NOTICE: this is the authors' version of a work that was accepted for publication in Trends in Genetics. Changes resulting from the publishing process such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Abstract 1 By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a 2 functional element. To infer functionality from frequency, it is crucial to precisely characterize 3 occurrences in neutrally evolving DNA. We find that the frequency of oligonucleotides in a 4 genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates 5 lognormal and power-law features found across all known genomes. Such a distribution may be 6 the result of completely random evolution by a copying process. Our characterization of the 7 entire frequency distribution of genomic words opens a way to a more accurate reasoning about 8 their over-and under-representation in genomic sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reconsidering the significance of genomic word frequencies.

By conventional wisdom, a feature that occurs too often or too rarely in a genome can indicate a functional element. To infer functionality from frequency, it is crucial to precisely characterize occurrences in randomly evolving DNA. We find that the frequency of oligonucleotides in a genomic sequence follows primarily a Pareto-lognormal distribution, which encapsulates lognormal and power-law ...

متن کامل

Genomic Signature Is Preserved in Short DNA Fragments

The recent availability of complete genomes opens a new field of research devoted to the general analysis of their global structure without regard to gene interpretation. The Chaos Game Representation of DNA sequence [2], when modified to allow for quantification, displays the whole set of frequencies of words found in a given genomic sequence under the form of images where the value of each pi...

متن کامل

Japanese mental syllabary and effects of mora, syllable, bi-mora and word frequencies on Japanese speech production.

The present study investigated the existence of a Japanese mental syllabary and units stored therein for speech production. Experiment 1 compared naming latencies between high and low initial mora frequencies using CVCVCV nonwords, indicating that nonwords with a high initial mora frequency were named faster than those with a low frequency initial mora. Experiments 2 and 3 clarified the possibi...

متن کامل

Large Deviations and Full Edgeworth Expansions for Finite Markov Chains with Applications to the Analysis of Genomic Sequences

To establish lists of words with unexpected frequencies in long sequences, for instance in a molecular biology context, one needs to quantify the exceptionality of families of word frequencies in random sequences. To this aim, we study large deviation probabilities of multidimensional word counts for Markov and hidden Markov models. More specifically, we compute local Edgeworth expansions of ar...

متن کامل

Growth of microbial genomes by short segmental duplications

A DNA sequence can be analyzed as a text of four letters by counting the times each word in the set of k-letter words occurs in the text. If the text is random and long enough, then the frequencies of word occurrence are expected to obey a Poisson distribution. Examination of complete microbial genomes shows that for k less than 9, the distribution has a width many times the width of a Poisson ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007